Problem 1: univariate and unsupervised analysis (20 points)

Download and read “Census Income” data into R and prepare graphical and numerical summaries of it: e.g. histograms of continuous attributes, contingency tables of categorical variables, scatterplots of continuous attributes with some of the categorical variables indicated by color/symbol shape, etc. Perform principal components analysis of this data (do you need to scale it prior to that? how would you represent multilevel categorical attributes to be used as inputs for PCA?) and plot observations in the space of the first few principal components with subjects’ gender and/or categorized income indicated by color/shape of the symbol. Perform univariate assessment of associations between outcome we will be modeling and each of the attributes (e.g. t-test or logistic regression for continuous attributes, contingency tables/Fisher exact test/\(\chi^2\) test for categorical attributes). Summarize your observations from these assessments: does it appear that there is association between outcome and predictors? Which predictors seem to be more/less relevant?

adultDat <- read.csv("adult.data", header = FALSE)

#Skipping first row as it contains a strange string
adultTest <- read.csv("adult.test", skip=1, header=FALSE)

#Take a peek
head(adultDat)
##   V1                V2     V3         V4 V5                  V6
## 1 39         State-gov  77516  Bachelors 13       Never-married
## 2 50  Self-emp-not-inc  83311  Bachelors 13  Married-civ-spouse
## 3 38           Private 215646    HS-grad  9            Divorced
## 4 53           Private 234721       11th  7  Married-civ-spouse
## 5 28           Private 338409  Bachelors 13  Married-civ-spouse
## 6 37           Private 284582    Masters 14  Married-civ-spouse
##                   V7             V8     V9     V10  V11 V12 V13
## 1       Adm-clerical  Not-in-family  White    Male 2174   0  40
## 2    Exec-managerial        Husband  White    Male    0   0  13
## 3  Handlers-cleaners  Not-in-family  White    Male    0   0  40
## 4  Handlers-cleaners        Husband  Black    Male    0   0  40
## 5     Prof-specialty           Wife  Black  Female    0   0  40
## 6    Exec-managerial           Wife  White  Female    0   0  40
##              V14    V15
## 1  United-States  <=50K
## 2  United-States  <=50K
## 3  United-States  <=50K
## 4  United-States  <=50K
## 5           Cuba  <=50K
## 6  United-States  <=50K
head(adultTest)
##   V1         V2     V3            V4 V5                  V6
## 1 25    Private 226802          11th  7       Never-married
## 2 38    Private  89814       HS-grad  9  Married-civ-spouse
## 3 28  Local-gov 336951    Assoc-acdm 12  Married-civ-spouse
## 4 44    Private 160323  Some-college 10  Married-civ-spouse
## 5 18          ? 103497  Some-college 10       Never-married
## 6 34    Private 198693          10th  6       Never-married
##                   V7             V8     V9     V10  V11 V12 V13
## 1  Machine-op-inspct      Own-child  Black    Male    0   0  40
## 2    Farming-fishing        Husband  White    Male    0   0  50
## 3    Protective-serv        Husband  White    Male    0   0  40
## 4  Machine-op-inspct        Husband  Black    Male 7688   0  40
## 5                  ?      Own-child  White  Female    0   0  30
## 6      Other-service  Not-in-family  White    Male    0   0  30
##              V14     V15
## 1  United-States  <=50K.
## 2  United-States  <=50K.
## 3  United-States   >50K.
## 4  United-States   >50K.
## 5  United-States  <=50K.
## 6  United-States  <=50K.
#Assign names, taken from the corresponding txt files
adultNames <- unique(str_match(readLines("old.adult.names.txt"), "^[a-zA-Z-]+"))
adultNames <- adultNames[!is.na(adultNames)]

#Replace dashes with dots for easier reference
adultNames <-gsub('-','.',adultNames)

#Assign names 
colnames(adultDat) <- adultNames
head(adultDat)
##   age         workclass fnlwgt  education education.num
## 1  39         State-gov  77516  Bachelors            13
## 2  50  Self-emp-not-inc  83311  Bachelors            13
## 3  38           Private 215646    HS-grad             9
## 4  53           Private 234721       11th             7
## 5  28           Private 338409  Bachelors            13
## 6  37           Private 284582    Masters            14
##        marital.status         occupation   relationship   race     sex
## 1       Never-married       Adm-clerical  Not-in-family  White    Male
## 2  Married-civ-spouse    Exec-managerial        Husband  White    Male
## 3            Divorced  Handlers-cleaners  Not-in-family  White    Male
## 4  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
## 5  Married-civ-spouse     Prof-specialty           Wife  Black  Female
## 6  Married-civ-spouse    Exec-managerial           Wife  White  Female
##   capital.gain capital.loss hours.per.week native.country  class
## 1         2174            0             40  United-States  <=50K
## 2            0            0             13  United-States  <=50K
## 3            0            0             40  United-States  <=50K
## 4            0            0             40  United-States  <=50K
## 5            0            0             40           Cuba  <=50K
## 6            0            0             40  United-States  <=50K
colnames(adultTest) <- adultNames
head(adultTest)
##   age  workclass fnlwgt     education education.num      marital.status
## 1  25    Private 226802          11th             7       Never-married
## 2  38    Private  89814       HS-grad             9  Married-civ-spouse
## 3  28  Local-gov 336951    Assoc-acdm            12  Married-civ-spouse
## 4  44    Private 160323  Some-college            10  Married-civ-spouse
## 5  18          ? 103497  Some-college            10       Never-married
## 6  34    Private 198693          10th             6       Never-married
##           occupation   relationship   race     sex capital.gain
## 1  Machine-op-inspct      Own-child  Black    Male            0
## 2    Farming-fishing        Husband  White    Male            0
## 3    Protective-serv        Husband  White    Male            0
## 4  Machine-op-inspct        Husband  Black    Male         7688
## 5                  ?      Own-child  White  Female            0
## 6      Other-service  Not-in-family  White    Male            0
##   capital.loss hours.per.week native.country   class
## 1            0             40  United-States  <=50K.
## 2            0             50  United-States  <=50K.
## 3            0             40  United-States   >50K.
## 4            0             40  United-States   >50K.
## 5            0             30  United-States  <=50K.
## 6            0             30  United-States  <=50K.
#Check that numerics are read in correctly and characters read as factors
summary(adultDat)
##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :22696   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00    Local-gov       : 2093   Median : 178356  
##  Mean   :38.58    ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)          :  981                    
##          education     education.num                  marital.status 
##   HS-grad     :10501   Min.   : 1.00    Divorced             : 4443  
##   Some-college: 7291   1st Qu.: 9.00    Married-AF-spouse    :   23  
##   Bachelors   : 5355   Median :10.00    Married-civ-spouse   :14976  
##   Masters     : 1723   Mean   :10.08    Married-spouse-absent:  418  
##   Assoc-voc   : 1382   3rd Qu.:12.00    Never-married        :10683  
##   11th        : 1175   Max.   :16.00    Separated            : 1025  
##  (Other)      : 5134                    Widowed              :  993  
##             occupation            relationship  
##   Prof-specialty :4140    Husband       :13193  
##   Craft-repair   :4099    Not-in-family : 8305  
##   Exec-managerial:4066    Other-relative:  981  
##   Adm-clerical   :3770    Own-child     : 5068  
##   Sales          :3650    Unmarried     : 3446  
##   Other-service  :3295    Wife          : 1568  
##  (Other)         :9541                          
##                   race            sex         capital.gain  
##   Amer-Indian-Eskimo:  311    Female:10771   Min.   :    0  
##   Asian-Pac-Islander: 1039    Male  :21790   1st Qu.:    0  
##   Black             : 3124                   Median :    0  
##   Other             :  271                   Mean   : 1078  
##   White             :27816                   3rd Qu.:    0  
##                                              Max.   :99999  
##                                                             
##   capital.loss    hours.per.week         native.country     class      
##  Min.   :   0.0   Min.   : 1.00    United-States:29170    <=50K:24720  
##  1st Qu.:   0.0   1st Qu.:40.00    Mexico       :  643    >50K : 7841  
##  Median :   0.0   Median :40.00    ?            :  583                 
##  Mean   :  87.3   Mean   :40.44    Philippines  :  198                 
##  3rd Qu.:   0.0   3rd Qu.:45.00    Germany      :  137                 
##  Max.   :4356.0   Max.   :99.00    Canada       :  121                 
##                                   (Other)       : 1709
summary(adultTest)
##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :11210   Min.   :  13492  
##  1st Qu.:28.00    Self-emp-not-inc: 1321   1st Qu.: 116736  
##  Median :37.00    Local-gov       : 1043   Median : 177831  
##  Mean   :38.77    ?               :  963   Mean   : 189436  
##  3rd Qu.:48.00    State-gov       :  683   3rd Qu.: 238384  
##  Max.   :90.00    Self-emp-inc    :  579   Max.   :1490400  
##                  (Other)          :  482                    
##          education    education.num                  marital.status
##   HS-grad     :5283   Min.   : 1.00    Divorced             :2190  
##   Some-college:3587   1st Qu.: 9.00    Married-AF-spouse    :  14  
##   Bachelors   :2670   Median :10.00    Married-civ-spouse   :7403  
##   Masters     : 934   Mean   :10.07    Married-spouse-absent: 210  
##   Assoc-voc   : 679   3rd Qu.:12.00    Never-married        :5434  
##   11th        : 637   Max.   :16.00    Separated            : 505  
##  (Other)      :2491                    Widowed              : 525  
##             occupation            relationship 
##   Prof-specialty :2032    Husband       :6523  
##   Exec-managerial:2020    Not-in-family :4278  
##   Craft-repair   :2013    Other-relative: 525  
##   Sales          :1854    Own-child     :2513  
##   Adm-clerical   :1841    Unmarried     :1679  
##   Other-service  :1628    Wife          : 763  
##  (Other)         :4893                         
##                   race            sex         capital.gain  
##   Amer-Indian-Eskimo:  159    Female: 5421   Min.   :    0  
##   Asian-Pac-Islander:  480    Male  :10860   1st Qu.:    0  
##   Black             : 1561                   Median :    0  
##   Other             :  135                   Mean   : 1082  
##   White             :13946                   3rd Qu.:    0  
##                                              Max.   :99999  
##                                                             
##   capital.loss    hours.per.week         native.country      class      
##  Min.   :   0.0   Min.   : 1.00    United-States:14662    <=50K.:12435  
##  1st Qu.:   0.0   1st Qu.:40.00    Mexico       :  308    >50K. : 3846  
##  Median :   0.0   Median :40.00    ?            :  274                  
##  Mean   :  87.9   Mean   :40.39    Philippines  :   97                  
##  3rd Qu.:   0.0   3rd Qu.:45.00    Puerto-Rico  :   70                  
##  Max.   :3770.0   Max.   :99.00    Germany      :   69                  
##                                   (Other)       :  801
#From here we can see that the numerics values are clean (no NAs). 
#Will transpose and merge the unique values of each character variable
#to see what may need cleaning before we concatenate the datasets
sapply(adultDat,class)
##            age      workclass         fnlwgt      education  education.num 
##      "integer"       "factor"      "integer"       "factor"      "integer" 
## marital.status     occupation   relationship           race            sex 
##       "factor"       "factor"       "factor"       "factor"       "factor" 
##   capital.gain   capital.loss hours.per.week native.country          class 
##      "integer"      "integer"      "integer"       "factor"       "factor"
#Store bool of Character and Numeric variables
charVect <- sapply(adultDat, class) %in% c('character', 'factor')
numVect <- sapply(adultDat, class) %in% c('integer')

#Distinct categorical data in each data frame
adultLong <- melt(as.matrix(adultDat[,charVect])) %>% group_by(Var2,value) %>% summarise(n.adult=n_distinct(value)) 
testLong <- melt(as.matrix(adultTest[,charVect])) %>% group_by(Var2,value) %>% summarise(n.test=n_distinct(value)) 

#Values that will need to be investigated
merge(adultLong,testLong, by=c("Var2","value"),all=TRUE) %>% filter(is.na((n.adult+n.test)))
##             Var2               value n.adult n.test
## 1 native.country  Holand-Netherlands       1     NA
## 2          class               <=50K       1     NA
## 3          class                >50K       1     NA
## 4          class              <=50K.      NA      1
## 5          class               >50K.      NA      1
#Holland-Netherlands is a valid country name, so will just compress the . from the class variable in the test dataset.
adultTest$class <- gsub('[.]','',adultTest$class)

#We can also see that we have some missing values in the following variables that we will have to look into imputing
adultLong[adultLong$value==' ?',1]
## Source: local data frame [3 x 1]
## Groups: Var2 [3]
## 
##             Var2
##           <fctr>
## 1      workclass
## 2     occupation
## 3 native.country
#Join our cleaned datasets
incomeDat <- rbind(adultDat,adultTest)

#Checking that the class of variables is preserved
summary(incomeDat)
##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :33906   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 3862   1st Qu.: 117550  
##  Median :37.00    Local-gov       : 3136   Median : 178144  
##  Mean   :38.64    ?               : 2799   Mean   : 189664  
##  3rd Qu.:48.00    State-gov       : 1981   3rd Qu.: 237642  
##  Max.   :90.00    Self-emp-inc    : 1695   Max.   :1490400  
##                  (Other)          : 1463                    
##          education     education.num                  marital.status 
##   HS-grad     :15784   Min.   : 1.00    Divorced             : 6633  
##   Some-college:10878   1st Qu.: 9.00    Married-AF-spouse    :   37  
##   Bachelors   : 8025   Median :10.00    Married-civ-spouse   :22379  
##   Masters     : 2657   Mean   :10.08    Married-spouse-absent:  628  
##   Assoc-voc   : 2061   3rd Qu.:12.00    Never-married        :16117  
##   11th        : 1812   Max.   :16.00    Separated            : 1530  
##  (Other)      : 7625                    Widowed              : 1518  
##             occupation             relationship  
##   Prof-specialty : 6172    Husband       :19716  
##   Craft-repair   : 6112    Not-in-family :12583  
##   Exec-managerial: 6086    Other-relative: 1506  
##   Adm-clerical   : 5611    Own-child     : 7581  
##   Sales          : 5504    Unmarried     : 5125  
##   Other-service  : 4923    Wife          : 2331  
##  (Other)         :14434                          
##                   race            sex         capital.gain  
##   Amer-Indian-Eskimo:  470    Female:16192   Min.   :    0  
##   Asian-Pac-Islander: 1519    Male  :32650   1st Qu.:    0  
##   Black             : 4685                   Median :    0  
##   Other             :  406                   Mean   : 1079  
##   White             :41762                   3rd Qu.:    0  
##                                              Max.   :99999  
##                                                             
##   capital.loss    hours.per.week         native.country     class      
##  Min.   :   0.0   Min.   : 1.00    United-States:43832    <=50K:37155  
##  1st Qu.:   0.0   1st Qu.:40.00    Mexico       :  951    >50K :11687  
##  Median :   0.0   Median :40.00    ?            :  857                 
##  Mean   :  87.5   Mean   :40.42    Philippines  :  295                 
##  3rd Qu.:   0.0   3rd Qu.:45.00    Germany      :  206                 
##  Max.   :4356.0   Max.   :99.00    Puerto-Rico  :  184                 
##                                   (Other)       : 2517
#Class looks to fixed now
missWrkClss <- incomeDat[incomeDat$workclass==' ?',]
missOcupatn <- incomeDat[incomeDat$occupation==' ?',]
missNatCntr <- incomeDat[incomeDat$native.country==' ?',]

#For the time being, going to replace vars with ? with NA
incomeDat[,charVect] <- apply(incomeDat[,charVect],2,function(x) gsub(' [?]',NA,x))

#Remove the leading spaces from the character variables
incomeDat[,charVect] <- apply(incomeDat[,charVect],2,function(x) gsub('^ ','',x))

#Converting back to factor
incomeDat <- as.data.frame(unclass(incomeDat))

#Remove all records with NA
nIncomeDat <- na.omit(incomeDat)

#Lets see how much data we lose by removing NAs
(nrow(incomeDat)-nrow(nIncomeDat))/nrow(incomeDat)*100
## [1] 7.411654
#We lose about 7.5% of the data, however it seems more hassle than its worth to impute values for this amount of data, so will continue with these. Especially since we cannot guarantee whether the missings are random or if there is an underlying pattern.

#Dataset is too big for testing, taking a random sample of 2k 
sampIncome <- data.frame(nIncomeDat[sample(1:nrow(nIncomeDat), 2000,
    replace=FALSE),])

#Categorical data
#Contingency table
nIncomeDat[,charVect] %>% melt(id="class") %>% group_by(variable,value,class) %>% summarise(n=n()) %>% dcast(variable+value~class,value.var="n") %>% kable()
## Warning: attributes are not identical across measure variables; they will
## be dropped
variable value <=50K >50K
workclass Federal-gov 857 549
workclass Local-gov 2185 915
workclass Private 26056 7251
workclass Self-emp-inc 734 912
workclass Self-emp-not-inc 2737 1059
workclass State-gov 1426 520
workclass Without-pay 19 2
education 10th 1141 82
education 11th 1530 89
education 12th 534 43
education 1st-4th 214 8
education 5th-6th 427 22
education 7th-8th 768 55
education 9th 638 38
education Assoc-acdm 1109 398
education Assoc-voc 1455 504
education Bachelors 4392 3178
education Doctorate 145 399
education HS-grad 12367 2416
education Masters 1121 1393
education Preschool 71 1
education Prof-school 193 592
education Some-college 7909 1990
marital.status Divorced 5642 655
marital.status Married-AF-spouse 18 14
marital.status Married-civ-spouse 11491 9564
marital.status Married-spouse-absent 498 54
marital.status Never-married 13897 701
marital.status Separated 1312 99
marital.status Widowed 1156 121
occupation Adm-clerical 4784 756
occupation Armed-Forces 10 4
occupation Craft-repair 4665 1355
occupation Exec-managerial 3117 2867
occupation Farming-fishing 1308 172
occupation Handlers-cleaners 1911 135
occupation Machine-op-inspct 2605 365
occupation Other-service 4612 196
occupation Priv-house-serv 229 3
occupation Prof-specialty 3304 2704
occupation Protective-serv 669 307
occupation Sales 3953 1455
occupation Tech-support 1009 411
occupation Transport-moving 1838 478
relationship Husband 10159 8507
relationship Not-in-family 10474 1228
relationship Other-relative 1299 50
relationship Own-child 6521 105
relationship Unmarried 4486 302
relationship Wife 1075 1016
race Amer-Indian-Eskimo 382 53
race Asian-Pac-Islander 934 369
race Black 3694 534
race Other 308 45
race White 28696 10207
sex Female 13026 1669
sex Male 20988 9539
native.country Cambodia 17 9
native.country Canada 103 60
native.country China 77 36
native.country Columbia 78 4
native.country Cuba 99 34
native.country Dominican-Republic 92 5
native.country Ecuador 37 6
native.country El-Salvador 136 11
native.country England 72 47
native.country France 20 16
native.country Germany 135 58
native.country Greece 31 18
native.country Guatemala 83 3
native.country Haiti 60 9
native.country Holand-Netherlands 1 NA
native.country Honduras 17 2
native.country Hong 20 8
native.country Hungary 12 6
native.country India 85 62
native.country Iran 34 22
native.country Ireland 26 10
native.country Italy 67 33
native.country Jamaica 89 14
native.country Japan 58 31
native.country Laos 19 2
native.country Mexico 856 47
native.country Nicaragua 45 3
native.country Outlying-US(Guam-USVI-etc) 21 1
native.country Peru 41 4
native.country Philippines 199 84
native.country Poland 65 16
native.country Portugal 50 12
native.country Puerto-Rico 155 20
native.country Scotland 18 2
native.country South 83 18
native.country Taiwan 30 25
native.country Thailand 24 5
native.country Trinadad&Tobago 24 2
native.country United-States 30844 10448
native.country Vietnam 76 7
native.country Yugoslavia 15 8
#We have a lot of granularity in the categorical data which we could do without

#Looking at the education level and education num, it looks like we dont need both education and education.num. Furthermore its prob better to collapse the levels also and turn it into dummy variables if possible
table(sampIncome$education,sampIncome$education.num)
##               
##                  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
##   10th           0   0   0   0   0  45   0   0   0   0   0   0   0   0   0
##   11th           0   0   0   0   0   0  78   0   0   0   0   0   0   0   0
##   12th           0   0   0   0   0   0   0  23   0   0   0   0   0   0   0
##   1st-4th        0  11   0   0   0   0   0   0   0   0   0   0   0   0   0
##   5th-6th        0   0  22   0   0   0   0   0   0   0   0   0   0   0   0
##   7th-8th        0   0   0  35   0   0   0   0   0   0   0   0   0   0   0
##   9th            0   0   0   0  36   0   0   0   0   0   0   0   0   0   0
##   Assoc-acdm     0   0   0   0   0   0   0   0   0   0   0  67   0   0   0
##   Assoc-voc      0   0   0   0   0   0   0   0   0   0  98   0   0   0   0
##   Bachelors      0   0   0   0   0   0   0   0   0   0   0   0 325   0   0
##   Doctorate      0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   HS-grad        0   0   0   0   0   0   0   0 642   0   0   0   0   0   0
##   Masters        0   0   0   0   0   0   0   0   0   0   0   0   0 102   0
##   Preschool      2   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   Prof-school    0   0   0   0   0   0   0   0   0   0   0   0   0   0  36
##   Some-college   0   0   0   0   0   0   0   0   0 456   0   0   0   0   0
##               
##                 16
##   10th           0
##   11th           0
##   12th           0
##   1st-4th        0
##   5th-6th        0
##   7th-8th        0
##   9th            0
##   Assoc-acdm     0
##   Assoc-voc      0
##   Bachelors      0
##   Doctorate     22
##   HS-grad        0
##   Masters        0
##   Preschool      0
##   Prof-school    0
##   Some-college   0
table(sampIncome$marital.status,sampIncome$relationship)
##                        
##                         Husband Not-in-family Other-relative Own-child
##   Divorced                    0           145              7        12
##   Married-AF-spouse           0             0              0         0
##   Married-civ-spouse        816             1              5         8
##   Married-spouse-absent       0            13              1         1
##   Never-married               0           317             43       258
##   Separated                   0            25              4         5
##   Widowed                     0            21              3         0
##                        
##                         Unmarried Wife
##   Divorced                    117    0
##   Married-AF-spouse             0    0
##   Married-civ-spouse            0   93
##   Married-spouse-absent         4    0
##   Never-married                44    0
##   Separated                    34    0
##   Widowed                      23    0
#Bar char colored by outcome
sampIncome[,charVect] %>% melt(id.vars="class") %>% 
  ggplot(aes(x=value,fill=class)) + geom_bar() + facet_wrap(~variable, scales = "free") 
## Warning: attributes are not identical across measure variables; they will
## be dropped

#Performing Chi-sq test on the whole dataset to see which variables are independent of income class
lapply(nIncomeDat[,charVect], function(x) {
chisq.test(x, nIncomeDat$class)
})  
## Warning in chisq.test(x, nIncomeDat$class): Chi-squared approximation may
## be incorrect
## Warning in chisq.test(x, nIncomeDat$class): Chi-squared approximation may
## be incorrect
## $workclass
## 
##  Pearson's Chi-squared test
## 
## data:  x and nIncomeDat$class
## X-squared = 1207.3, df = 6, p-value < 2.2e-16
## 
## 
## $education
## 
##  Pearson's Chi-squared test
## 
## data:  x and nIncomeDat$class
## X-squared = 5996, df = 15, p-value < 2.2e-16
## 
## 
## $marital.status
## 
##  Pearson's Chi-squared test
## 
## data:  x and nIncomeDat$class
## X-squared = 9109.2, df = 6, p-value < 2.2e-16
## 
## 
## $occupation
## 
##  Pearson's Chi-squared test
## 
## data:  x and nIncomeDat$class
## X-squared = 5415.1, df = 13, p-value < 2.2e-16
## 
## 
## $relationship
## 
##  Pearson's Chi-squared test
## 
## data:  x and nIncomeDat$class
## X-squared = 9357, df = 5, p-value < 2.2e-16
## 
## 
## $race
## 
##  Pearson's Chi-squared test
## 
## data:  x and nIncomeDat$class
## X-squared = 452.3, df = 4, p-value < 2.2e-16
## 
## 
## $sex
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x and nIncomeDat$class
## X-squared = 2104.1, df = 1, p-value < 2.2e-16
## 
## 
## $native.country
## 
##  Pearson's Chi-squared test
## 
## data:  x and nIncomeDat$class
## X-squared = 456.01, df = 40, p-value < 2.2e-16
## 
## 
## $class
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x and nIncomeDat$class
## X-squared = 45217, df = 1, p-value < 2.2e-16
#It seems that all variables are related to the outcome variable with a very strong significance



#Continuous variables
#Creates color pallette based on unique column values
#taken from ISLR
Cols=function(vec){
  cols=rainbow(length(unique(vec)))
  return(cols[as.numeric(as.factor(vec))]) 
}

#Scatterplot matrix, colored by income class
pairs(~.,data=sampIncome[,numVect], 
        col=Cols(sampIncome$class))

#Here age seems to be correlated with weight (fnlwgt) as is expected


#Histogram plot
as.matrix(sampIncome[,numVect]) %>% melt() %>% mutate(value=as.numeric(value)) %>% ggplot(aes(x=value)) + geom_histogram() + facet_wrap(~Var2, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

lapply(nIncomeDat[,numVect], function(x) t.test(x ~ nIncomeDat$class, var.equal = TRUE))
## $age
## 
##  Two Sample t-test
## 
## data:  x by nIncomeDat$class
## t = -51.885, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.530767 -6.982514
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            36.74943            44.00607 
## 
## 
## $fnlwgt
## 
##  Two Sample t-test
## 
## data:  x by nIncomeDat$class
## t = 1.5447, df = 45220, p-value = 0.1224
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -477.8235 4032.3173
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            190175.2            188398.0 
## 
## 
## $education.num
## 
##  Two Sample t-test
## 
## data:  x by nIncomeDat$class
## t = -75.048, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.019122 -1.916340
## sample estimates:
## mean in group <=50K  mean in group >50K 
##             9.63077            11.59850 
## 
## 
## $capital.gain
## 
##  Two Sample t-test
## 
## data:  x by nIncomeDat$class
## t = -48.195, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3999.048 -3686.488
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            149.0234           3991.7917 
## 
## 
## $capital.loss
## 
##  Two Sample t-test
## 
## data:  x by nIncomeDat$class
## t = -31.974, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -148.0033 -130.9059
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            54.03243           193.48706 
## 
## 
## $hours.per.week
## 
##  Two Sample t-test
## 
## data:  x by nIncomeDat$class
## t = -49.611, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.568092 -6.068839
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            39.37202            45.69049
#fnlWgt is the only variable that has no relationship with the outcome variable. We may consider dropping this


nvClass <- c(numVect[1:14],TRUE)

sampIncome[,nvClass] %>% melt(id.vars="class") %>% ggplot(aes(x=variable, y=value,group=class,fill=class)) + geom_boxplot() + facet_wrap(~variable, scales = "free")

#drop fnlwgt
#log transform age
#drop education.num
#pre-college (1:10), 

plot(nIncomeDat$education.num,col=Cols(nIncomeDat$class))

#Although education matters, it doesnt seem to be a large distinction prior to high school


cIncomeDat <- nIncomeDat
cIncomeDat$education <- as.character(cIncomeDat$education)
cIncomeDat$education[cIncomeDat$education.num <=10] <- "Pre-College"
cIncomeDat$education<- as.factor(cIncomeDat$education)

#Drop education num
cIncomeDat$education.num <- NULL

#Drop final weight
cIncomeDat$fnlwgt <- NULL

cIncomeDat$age <- log(cIncomeDat$age)

#Recoding countries to regions
cIncomeDat$region <- countrycode(cIncomeDat$native.country,"country.name","region")
## Warning in countrycode(cIncomeDat$native.country, "country.name", "region"): Some values were not matched unambiguously: Columbia, England, Hong, Scotland, South, Yugoslavia
#Some issues converting columbia, england, hong, scotland, south and yugoslavia
#need to fix these and try again

cIncomeDat$native.country <- as.character(cIncomeDat$native.country)
cIncomeDat$native.country[cIncomeDat$native.country=="Columbia"] <- "Colombia"
cIncomeDat$native.country[cIncomeDat$native.country=="Hong"] <- "Hong Kong"
cIncomeDat$native.country[cIncomeDat$native.country %in% c("England","Scotland")] <- "UK"

#For the sake of argument, just picking one
cIncomeDat$native.country[cIncomeDat$native.country %in% c("Yugoslavia")] <- "Bosnia"

cIncomeDat <- cIncomeDat[cIncomeDat$native.country!="South",]
cIncomeDat$region <-( countrycode(cIncomeDat$native.country,"country.name","region"))
cIncomeDat$region <- as.factor(cIncomeDat$region)
cIncomeDat$native.country <- NULL

#Creating our data matrix that will be used for the rest of the analysis
nrow(cIncomeDat)
## [1] 45121
dm <- model.matrix(~.+0, data=cIncomeDat)

#We have one variable (workclass never worked) with no values, we will remove this
dm <- dm[,-4]

summary(dm)
##       age        workclassFederal-gov workclassLocal-gov workclassPrivate
##  Min.   :2.833   Min.   :0.00000      Min.   :0.0000     Min.   :0.0000  
##  1st Qu.:3.332   1st Qu.:0.00000      1st Qu.:0.0000     1st Qu.:0.0000  
##  Median :3.611   Median :0.00000      Median :0.0000     Median :1.0000  
##  Mean   :3.592   Mean   :0.03112      Mean   :0.0687     Mean   :0.7369  
##  3rd Qu.:3.850   3rd Qu.:0.00000      3rd Qu.:0.0000     3rd Qu.:1.0000  
##  Max.   :4.500   Max.   :1.00000      Max.   :1.0000     Max.   :1.0000  
##  workclassSelf-emp-inc workclassSelf-emp-not-inc workclassState-gov
##  Min.   :0.00000       Min.   :0.00000           Min.   :0.00000   
##  1st Qu.:0.00000       1st Qu.:0.00000           1st Qu.:0.00000   
##  Median :0.00000       Median :0.00000           Median :0.00000   
##  Mean   :0.03621       Mean   :0.08351           Mean   :0.04311   
##  3rd Qu.:0.00000       3rd Qu.:0.00000           3rd Qu.:0.00000   
##  Max.   :1.00000       Max.   :1.00000           Max.   :1.00000   
##  workclassWithout-pay educationAssoc-voc educationBachelors
##  Min.   :0.0000000    Min.   :0.00000    Min.   :0.0000    
##  1st Qu.:0.0000000    1st Qu.:0.00000    1st Qu.:0.0000    
##  Median :0.0000000    Median :0.00000    Median :0.0000    
##  Mean   :0.0004654    Mean   :0.04335    Mean   :0.1671    
##  3rd Qu.:0.0000000    3rd Qu.:0.00000    3rd Qu.:0.0000    
##  Max.   :1.0000000    Max.   :1.00000    Max.   :1.0000    
##  educationDoctorate educationMasters  educationPre-College
##  Min.   :0.00000    Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:0.00000    1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :0.00000    Median :0.00000   Median :1.0000      
##  Mean   :0.01201    Mean   :0.05552   Mean   :0.6713      
##  3rd Qu.:0.00000    3rd Qu.:0.00000   3rd Qu.:1.0000      
##  Max.   :1.00000    Max.   :1.00000   Max.   :1.0000      
##  educationProf-school marital.statusMarried-AF-spouse
##  Min.   :0.00000      Min.   :0.0000000              
##  1st Qu.:0.00000      1st Qu.:0.0000000              
##  Median :0.00000      Median :0.0000000              
##  Mean   :0.01735      Mean   :0.0007092              
##  3rd Qu.:0.00000      3rd Qu.:0.0000000              
##  Max.   :1.00000      Max.   :1.0000000              
##  marital.statusMarried-civ-spouse marital.statusMarried-spouse-absent
##  Min.   :0.0000                   Min.   :0.00000                    
##  1st Qu.:0.0000                   1st Qu.:0.00000                    
##  Median :0.0000                   Median :0.00000                    
##  Mean   :0.4655                   Mean   :0.01217                    
##  3rd Qu.:1.0000                   3rd Qu.:0.00000                    
##  Max.   :1.0000                   Max.   :1.00000                    
##  marital.statusNever-married marital.statusSeparated marital.statusWidowed
##  Min.   :0.0000              Min.   :0.00000         Min.   :0.00000      
##  1st Qu.:0.0000              1st Qu.:0.00000         1st Qu.:0.00000      
##  Median :0.0000              Median :0.00000         Median :0.00000      
##  Mean   :0.3229              Mean   :0.03123         Mean   :0.02819      
##  3rd Qu.:1.0000              3rd Qu.:0.00000         3rd Qu.:0.00000      
##  Max.   :1.0000              Max.   :1.00000         Max.   :1.00000      
##  occupationArmed-Forces occupationCraft-repair occupationExec-managerial
##  Min.   :0.0000000      Min.   :0.0000         Min.   :0.0000           
##  1st Qu.:0.0000000      1st Qu.:0.0000         1st Qu.:0.0000           
##  Median :0.0000000      Median :0.0000         Median :0.0000           
##  Mean   :0.0003103      Mean   :0.1332         Mean   :0.1322           
##  3rd Qu.:0.0000000      3rd Qu.:0.0000         3rd Qu.:0.0000           
##  Max.   :1.0000000      Max.   :1.0000         Max.   :1.0000           
##  occupationFarming-fishing occupationHandlers-cleaners
##  Min.   :0.0000            Min.   :0.00000            
##  1st Qu.:0.0000            1st Qu.:0.00000            
##  Median :0.0000            Median :0.00000            
##  Mean   :0.0328            Mean   :0.04528            
##  3rd Qu.:0.0000            3rd Qu.:0.00000            
##  Max.   :1.0000            Max.   :1.00000            
##  occupationMachine-op-inspct occupationOther-service
##  Min.   :0.00000             Min.   :0.0000         
##  1st Qu.:0.00000             1st Qu.:0.0000         
##  Median :0.00000             Median :0.0000         
##  Mean   :0.06573             Mean   :0.1062         
##  3rd Qu.:0.00000             3rd Qu.:0.0000         
##  Max.   :1.00000             Max.   :1.0000         
##  occupationPriv-house-serv occupationProf-specialty
##  Min.   :0.000000          Min.   :0.0000          
##  1st Qu.:0.000000          1st Qu.:0.0000          
##  Median :0.000000          Median :0.0000          
##  Mean   :0.005142          Mean   :0.1328          
##  3rd Qu.:0.000000          3rd Qu.:0.0000          
##  Max.   :1.000000          Max.   :1.0000          
##  occupationProtective-serv occupationSales  occupationTech-support
##  Min.   :0.00000           Min.   :0.0000   Min.   :0.00000       
##  1st Qu.:0.00000           1st Qu.:0.0000   1st Qu.:0.00000       
##  Median :0.00000           Median :0.0000   Median :0.00000       
##  Mean   :0.02163           Mean   :0.1192   Mean   :0.03147       
##  3rd Qu.:0.00000           3rd Qu.:0.0000   3rd Qu.:0.00000       
##  Max.   :1.00000           Max.   :1.0000   Max.   :1.00000       
##  occupationTransport-moving relationshipNot-in-family
##  Min.   :0.00000            Min.   :0.000            
##  1st Qu.:0.00000            1st Qu.:0.000            
##  Median :0.00000            Median :0.000            
##  Mean   :0.05131            Mean   :0.259            
##  3rd Qu.:0.00000            3rd Qu.:1.000            
##  Max.   :1.00000            Max.   :1.000            
##  relationshipOther-relative relationshipOwn-child relationshipUnmarried
##  Min.   :0.00000            Min.   :0.0000        Min.   :0.0000       
##  1st Qu.:0.00000            1st Qu.:0.0000        1st Qu.:0.0000       
##  Median :0.00000            Median :0.0000        Median :0.0000       
##  Mean   :0.02968            Mean   :0.1465        Mean   :0.1058       
##  3rd Qu.:0.00000            3rd Qu.:0.0000        3rd Qu.:0.0000       
##  Max.   :1.00000            Max.   :1.0000        Max.   :1.0000       
##  relationshipWife  raceAsian-Pac-Islander   raceBlack     
##  Min.   :0.00000   Min.   :0.00000        Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000        1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000        Median :0.0000  
##  Mean   :0.04616   Mean   :0.02671        Mean   :0.0937  
##  3rd Qu.:0.00000   3rd Qu.:0.00000        3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.00000        Max.   :1.0000  
##    raceOther          raceWhite         sexMale        capital.gain  
##  Min.   :0.000000   Min.   :0.0000   Min.   :0.0000   Min.   :    0  
##  1st Qu.:0.000000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:    0  
##  Median :0.000000   Median :1.0000   Median :1.0000   Median :    0  
##  Mean   :0.007823   Mean   :0.8622   Mean   :0.6752   Mean   : 1102  
##  3rd Qu.:0.000000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:    0  
##  Max.   :1.000000   Max.   :1.0000   Max.   :1.0000   Max.   :99999  
##   capital.loss     hours.per.week    class>50K     regionCentral America
##  Min.   :   0.00   Min.   : 1.00   Min.   :0.000   Min.   :0.00000      
##  1st Qu.:   0.00   1st Qu.:40.00   1st Qu.:0.000   1st Qu.:0.00000      
##  Median :   0.00   Median :40.00   Median :0.000   Median :0.00000      
##  Mean   :  88.54   Mean   :40.93   Mean   :0.248   Mean   :0.02666      
##  3rd Qu.:   0.00   3rd Qu.:45.00   3rd Qu.:0.000   3rd Qu.:0.00000      
##  Max.   :4356.00   Max.   :99.00   Max.   :1.000   Max.   :1.00000      
##  regionEastern Asia regionEastern Europe regionMicronesia   
##  Min.   :0.000000   Min.   :0.000000     Min.   :0.0000000  
##  1st Qu.:0.000000   1st Qu.:0.000000     1st Qu.:0.0000000  
##  Median :0.000000   Median :0.000000     Median :0.0000000  
##  Mean   :0.006316   Mean   :0.002194     Mean   :0.0004876  
##  3rd Qu.:0.000000   3rd Qu.:0.000000     3rd Qu.:0.0000000  
##  Max.   :1.000000   Max.   :1.000000     Max.   :1.0000000  
##  regionNorthern America regionNorthern Europe regionSouth America
##  Min.   :0.0000         Min.   :0.000000      Min.   :0.000000   
##  1st Qu.:1.0000         1st Qu.:0.000000      1st Qu.:0.000000   
##  Median :1.0000         Median :0.000000      Median :0.000000   
##  Mean   :0.9188         Mean   :0.003879      Mean   :0.003768   
##  3rd Qu.:1.0000         3rd Qu.:0.000000      3rd Qu.:0.000000   
##  Max.   :1.0000         Max.   :1.000000      Max.   :1.000000   
##  regionSouth-Eastern Asia regionSouthern Asia regionSouthern Europe
##  Min.   :0.000000         Min.   :0.000000    Min.   :0.000000     
##  1st Qu.:0.000000         1st Qu.:0.000000    1st Qu.:0.000000     
##  Median :0.000000         Median :0.000000    Median :0.000000     
##  Mean   :0.009796         Mean   :0.004499    Mean   :0.005186     
##  3rd Qu.:0.000000         3rd Qu.:0.000000    3rd Qu.:0.000000     
##  Max.   :1.000000         Max.   :1.000000    Max.   :1.000000     
##  regionWestern Europe
##  Min.   :0.000000    
##  1st Qu.:0.000000    
##  Median :0.000000    
##  Mean   :0.005097    
##  3rd Qu.:0.000000    
##  Max.   :1.000000
pr <- prcomp(dm,scale=TRUE)
par(mfrow=c(1,1))
biplot(pr,cex=.5)

#From the biplot, it looks like Sex, and Region contribute heavily to the second principal component

plot(pr)

#We can see that the first two principal components to a good job of explaining a lot of the variance. However, we see that the data may not be so easily separable after that

plot(pr$x[,1:2], col=Cols(cIncomeDat$class))

#The class seems to be rather well defined by the second principal component

plot(pr$x[,1:2], col=Cols(cIncomeDat$sex))

#Gender is rather well defined by the second principal component but not nearly as well separated as class

#Top 5 contributers to PC1 and PC2 respectively
sort(abs(pr$rotation[,1]),decreasing=T)[1:5]
## marital.statusMarried-civ-spouse      marital.statusNever-married 
##                        0.3779606                        0.3356694 
##                        class>50K                              age 
##                        0.3333022                        0.2952995 
##            relationshipOwn-child 
##                        0.2600073
sort(abs(pr$rotation[,2]),decreasing=T)[1:5]
##   raceAsian-Pac-Islander                raceWhite   regionNorthern America 
##                0.4339857                0.4026078                0.3656846 
## regionSouth-Eastern Asia       regionEastern Asia 
##                0.3109853                0.2436549
set.seed(5124)

#holding out 20% for the test set
gTrain <- sample(c(FALSE,TRUE,TRUE,TRUE,TRUE),nrow(cIncomeDat),replace=TRUE)

train <- cIncomeDat[gTrain,]
test <- cIncomeDat[!gTrain,]